06. Policy Gradient Quiz

Policy Gradient Quiz

Suppose we are training an agent to play a computer game. There are only two possible action:

0 = Do nothing,
1 = Move

There are three time-steps in each game, and our policy is completely determined by one parameter \theta, such that the probability of "moving" is \theta, and the probability of doing nothing is 1-\theta

Initially \theta=0.5. Three games are played, the results are:

Game 1:
actions: (1,0,1)
rewards: (1,0,1)

Game 2:
actions: (1,0,0)
rewards: (0,0,1)

Game 3:
actions: (0,1,0)
rewards: (1,0,1)

Computing policy gradient

What are the future rewards for the first game?

Recall the results for game 1 are: actions: (1,0,1)
rewards: (1,0,1)

SOLUTION: (2,1,1)

What is the policy gradient computed from the second game, using future rewards?

actions: (1,0,0)
rewards: (0,0,1)

SOLUTION: -2

Which of these statements are true regarding the 3rd game? Recall the results for the 3rd game are:

actions: (0,1,0)
rewards: (1,0,1)

SOLUTION:
  • The contribution to the gradient from the second and third steps cancel each other
  • The computed policy gradient from this game is negative
  • Using the total reward vs future reward give the same policy gradient in this game